{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Truncated Singular Value Decomposition (TSVD) \n",
    "The TSVD algorithm is a linear dimensionality reduction algorithm that works really well for datasets in which samples correlated in large groups. Unlike PCA, TSVD does not center the data before computation. \n",
    "\n",
    "The model can take array-like objects, either in host as NumPy arrays or in device (as Numba or cuda_array_interface-compliant), as well as cuDF DataFrames as the input. \n",
    "\n",
    "For information on converting your dataset to cuDF format, refer to the documentation: https://rapidsai.github.io/projects/cudf/en/0.11.0/\n",
    "\n",
    "For information on cuML's TSVD implementation: https://rapidsai.github.io/projects/cuml/en/0.11.0/api.html#truncated-svd"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "\n",
    "import numpy as np\n",
    "\n",
    "import pandas as pd\n",
    "import cudf as gd\n",
    "\n",
    "from cuml.datasets import make_blobs\n",
    "\n",
    "from sklearn.decomposition import TruncatedSVD as skTSVD\n",
    "from cuml.decomposition import TruncatedSVD as cumlTSVD"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "n_samples = 2**15\n",
    "n_features = 128\n",
    "\n",
    "n_components = 2\n",
    "random_state = 42"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Generate Data\n",
    "\n",
    "### GPU"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "device_data, _ = make_blobs(\n",
    "   n_samples=n_samples, n_features=n_features, centers=1, random_state=7)\n",
    "\n",
    "device_data = gd.DataFrame.from_gpu_matrix(device_data)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Host"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "host_data = device_data.to_pandas()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Scikit-learn Model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "tsvd_sk = skTSVD(n_components=n_components,\n",
    "                 algorithm=\"arpack\", \n",
    "                 n_iter=5000,\n",
    "                 tol=0.00001,\n",
    "                 random_state=random_state)\n",
    "\n",
    "result_sk = tsvd_sk.fit_transform(host_data)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## cuML Model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "tsvd_cuml = cumlTSVD(n_components=n_components,\n",
    "                     algorithm=\"full\", \n",
    "                     n_iter=50000,\n",
    "                     tol=0.00001,\n",
    "                     random_state=random_state)\n",
    "\n",
    "result_cuml = tsvd_cuml.fit_transform(device_data)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Evaluate Results\n",
    "\n",
    "### Singular Values"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "passed = np.allclose(tsvd_sk.singular_values_, \n",
    "                     tsvd_cuml.singular_values_.to_array(), \n",
    "                     atol=0.01)\n",
    "print('compare tsvd: cuml vs sklearn singular_values_ {}'.format('equal' if passed else 'NOT equal'))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Components"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "passed = np.allclose(tsvd_sk.components_, \n",
    "                     np.asarray(tsvd_cuml.components_.as_gpu_matrix()), \n",
    "                     atol=1e-2)\n",
    "print('compare tsvd: cuml vs sklearn components_ {}'.format('equal' if passed else 'NOT equal'))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Transform"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# compare the reduced matrix\n",
    "passed = np.allclose(result_sk, np.asarray(result_cuml.as_gpu_matrix()), atol=0.2)\n",
    "# larger error margin due to different algorithms: arpack vs full\n",
    "print('compare tsvd: cuml vs sklearn transformed results %s'%('equal'if passed else 'NOT equal'))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
